Presented to Graduate School of Engineering at the University of Tokyo for the Degree of Master in 2003 Text Classification with a Polysemy Considered Feature Set
نویسندگان
چکیده
As we store and distribute a large amount of computerized text, we have an important issue about how we extract useful data effectively from the text data. For this reason, the techniques for classifying text automatically with computers have attached attention. Generally, in a field of text classification, we use a model called Vector Space Model(VSM), in which we map a document into a point in a vector space with multiple dimension that has axes based on feature sets of keywords to characterize categories. In the past, lots of different attempts to extract words with highly evaluated values based on some measures, such as mutual information between categories and words, have been made for selection of feature words which characterize categories in text classification. However, some words are polysemous ones which have not a single meaning but multiple meanings, and therefore in the case of those polysemous words, there are documents which belong to different categories from the one to be intended, which causes problems for classification. In our research, we consider polysemous words as features with a risk factor for classification, and propose a method that we determine whether each feature word is the risk factor or not, using mutual information as a measure for feature selection, and disambiguate feature sets by removing features judged as risk factors. We compare classifying results with our method to the ones with an existing method, and evaluate its efficiency by using the Reuters-21578 corpus as the target data for classifying.
منابع مشابه
A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier
With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...
متن کاملA hybrid filter-based feature selection method via hesitant fuzzy and rough sets concepts
High dimensional microarray datasets are difficult to classify since they have many features with small number ofinstances and imbalanced distribution of classes. This paper proposes a filter-based feature selection method to improvethe classification performance of microarray datasets by selecting the significant features. Combining the concepts ofrough sets, weighted rough set, fuzzy rough se...
متن کاملA Novel One Sided Feature Selection Method for Imbalanced Text Classification
The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of t...
متن کاملAn Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification
Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...
متن کاملAn Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification
In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009